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ABSTRACT 

Four different systematic methods for selecting 

passing scores which differ primarily in the types of judgment they 
??9?? ??_ w ?? e _C9 m P? r ed A The borderline group method and the 
contrasting groups method were each compared with the Nedelsky method 
at four schools and the Angoff method at another four schools. The 
Basic Skills Assessment Tests in reading and mathematics were 
administered for the study. The projeet was designed to determine 
whether different methods yield similar passing scores and, if hot, 
whether the differences between them are systematic and predictable. 
The Nedelsky and Angoff methods are based upon Judgments about test 
items. The borderline and contrasting group methods produce similar 
results when approximately equal numbers of students are classified 
as masters and non-masters. The contrasting group passing score 
produced different results when the ratio of masters to non-masters 
fluctuated. The Nedelsky and Angoff methods produced inconsistent 
results across schools. The passing scores were higher at schools 
with more able students . Results of the study suggest that those who 
set passing scores should use methods based upon test scores of 
actual test takers whenever possible. (Author/SWH) 
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Abstract- 

The borderline-group method and the contrasting-groups method were each 
compared with Nedelsky's method at four schools and with Angoff's method at 
another four schools, using tests of basic skills in reading and ma thematic 
The borderline-group and contrasting-groups methods produced similar result 
when approximately equa3 numbers of students were classified as masters and 
nonmastors. The contrasting-groups passing score was lower than the 
border line-group passing score when masters greatly outnumbered nonmas ters ; 
higher when nonmasters outnumbered masters. Results involving the Nedelsky 
and An go I f methods were not consistent across schools. Passing scores tend 
to be higher at schools where students were more able. 
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A Comparative Study of Standard-Setting Methods 

A passLii^ score on r. test represents an answer to the question, "How much 
is enough?" Hie passing score indicates the level of knowledge or skill that 
will be considered sufficient for some purpose. Any method of choosing the 
passing score reqiii res judgment at some stage of the process. Several 
different systematic methods for choosing passing scores have been suggested. 
(See Sliepurd; 1930, for a review.) These methods differ primarily in the 
kinds of judgment they require. 

The purpose of the present study was to compare four of these methods for 
choosing a passing score, in an attempt to answer the following questions: 

1: When these different methods are applied to the same test, with the 
same persons making the judgments, do they yield similar passing 
sco res ? 

2. To the extent that the passing scores differ from one method to 
another, are these differences systematic and predictable? 

The four methods investigated were "Nedelsky' s method" (Nedelsky , 1954) , 
"Arigbff's method" (Angof f , 1971) , the "borderline group method", and the 
"contrast ing-g roups method". (For a detailed description of these methods, 
see Livingston and Zieky , 1982.) The first two of these methods are based on 
judgments about the questions on the test . The judges are asked to envision 
the way a "borderline" test-taker would respond to each quest ion on the" test. 
A borderline test-taker is one whose level of the knowledge or skil Is measured 
by the test is on the bo rde rl i ne between suf £ icient and insuf f icient . 

in Nedelsky f s method (which can be used only with multiple-choice tests), 
the judges are asked to decide which of the wrong answer-choices the border- 
line test-taker could identify as riot- being the correct answer. These 

^Ahgbff (personal communication, 19 83) attributes this method to Ledyard R 
Tucker . 



judgments are used to estimate, for each test question, the probability that e 
borderline test-taker would choose the correct answer. these probabilities 
arc then summed, to yield the expected score for a borderline test-taker - a 
reasonable choice for the passing score, Angoff's method is similar to 
Nedelsky 1 s , except that the judges are asked to specify the probabilities 
direct ly . 

The borderline-group method follows the same logic as Nedelsky 1 s arid 
Angof f 1 s methods. However , instead of making judgments about each question, 
the judges nominate specific individual test-takers as having a "border line" 
level of the knowledge or skills the test measures. The score that is typical 
of these "borderline" students 1 performance on the test - usually the median - 
is taken as the passing score. 

The con tr as ting-group method is a bit more complex. The judges classify 
individual test-takers as "masters' 1 (i.e., those with a sufficient level of 
knowledge or skill) or "nonmasters" (i.e., those with an insufficient level of 
knowledge or skill). The passing score is usually chosen to minimize the 
number of wrong decisions (i.e., failing a "master" or passing a "nonmaster") . 
The passing score can also be chosen to minimize a weighted sum of the two 
types of wrong decisions. However, in this study, we have used the passing 
score that weights the two types of wrong decisions equally. 

Although several previous studies have compared passing scores produced by 
different standard-sett ing methods , only a few have compared methods based on 
judgments of items with methods based on j udgment s of actual test-takers. 
Koffler (1980) compared Nedelsky 1 s method with the con trast ing-groups method, 
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for eight different tests, with results that varied considerably from brie test 
to another: Poggio, Clasnapp, and Eros (1981) found systematic differences 
between methods: Ebel's method produced the highest passing score, followed 
by Angoff's method, the contra sting-group method, and Nedelsky's method in 
that order. Mills and Barr (1983) found that both Angoff's method arid Ebel's 
method consistently produced higher passing scores than did the contrasting- 
groups method. Taken together, these previous findings suggest that there may 
be systematic differences between methods. If so, the results of the present 
study should reflect those differences. 

Method 

The tests for this study were the Basic Skills Assessment Tests in reading 
and mathematics, developed by Educational Testing Service. Both tests are 
made up of four-option multiple-choice questions. The reading test contains 
65 questions; the math test contains 70. These tests are intended to test the 
basic reading and math skills required in the daily life of an American adult. 
For example, the reading test includes excerpts from a medicine bottle label, 
a newspaper want-ad section, a road map, and the yellow pages of a telephone 
directory. The math test includes problems in the four basic arithmetic 
operations and applications such as comparing unit prices, adding sales tax tc 
a restaurant check, etc. ''Mastery 11 was defined, for the purpose of this 
study, as the ability to perform adequately the reading/mathematical tasks of 
adult life in modern American society. These tasks were not specified or 
enumerated. 

f 8 
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The judges for the study were teachers of students in grades 6, 7, arid 8 
We chose these grade levels in the hope of finding both a substantial number 
of students who had achieved mastery and a substantial number v:ho had not: 
The judges for the reading test were teachers of English, reading, or iangua 
arts. The judges for the math test were math teachers. (In one school, two 
science teachers also served as judges for the math test.) 

Right schools participated in the study, each one from a different schoo 
district.* These eight school districts represented a wide range of 
socioeconomic conditions. In each school, from three to five teachers serve 
as judges for each test. The schools included various combinations of grade 
levels (e.g., K-8, 6-8, 7-12). 

The experimental design called for the teachers in each school to make 
judgments for three standard-setting methods: the borderline group method, 
the contrasting groups method, and either Nedelsky 1 s method or Angof f ? s 
method. In four schools the teachers made judgments of their students befor 
making the Angof f /Nede lsky judgments; in the other four schools the order wa 
reversed. The resulting design, including the grade levels of the students 
participating, is shown in Table 1. 

In the schools where the teachers judged the questions first, the 
researchers met with the teachers only once, for approximately two hours. 



One of the eight schools was, in fact, two schools, located in the same 
complex of buildings but administratively separate^ These two schools 
participated together in the study, and the teachers from :he two schools m 
together for the standard-setting sessions. In this repo-t they will be 
treated as one school. 
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THJy began by explaining the purpose of the study. Next they explained the 
way in which individual students were to be judged as "masters 11 , "nonmasters", 
or "borderline", emphasizing that these judgments were to be based bri the 
reading or math skills necessary to function as an adult in American society. 
("If this student had to do all the reading/mathematics for his/her family, 
ecu 1 d lie/she do it adequately?") The form bri which the teachers recorded 
their judgments of the students' skills also had a "cannot judge" category for 
students whose level of ski^l the teacher was unsure of. The researchers 
distributed these forms and the test materials. Then one researcher led the 
nath teachers in an Angof f or Nedel sky standard-setting session, while the 
other researcher did the same with the reading, English, or language arts 
teachers. After the meeting the teachers administered the tests co their 
students and returned the completed answer sheets by mail. 

In the schools where the teachers judged the students firsts the 
researchers met with the teachers twice. At the first meeting, one of the 
researchers explained the purpose of the study, described the procedure for 
judging the students, and distributed the judgment forms arid test materials. 
The researcher asked the teachers to look through the tests to see what kinds 
of skills the tests measured, Bul to use their own ideas of the skills 
required in daily adult life as the basis for all judgments involved in the 
study. At. the second meeting, approximately a week later, the researchers 
collected the' judgment forms and conducted the Angof f or Nedeisky standard- 
setting session; 
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For the Angoff standard-setting sessions the teachers were given copies 
the test with the correct answers indicated and a form for recording their 
judgments. The leader (i.e., the researcher leading the session) began by 
explaining the logic of the method arid reviewing the definition of mastery c 
of the "borderline 11 student. Next, the leader asked the teachers to make 
judgments for a number of selected questions, comparing and discussing their 
judgments for each question. The teachers then worked independently, making 
preliminary judgment for each question. After approximately half an hour, t 
leader polled the group for their judgment s on each of the questions they ha 
finished judging. Whenever there was a discrepancy of 20 or more percentage 
points between any two teachers, the leader asked the teachers with the 
highest and lowest judgments to explain their reasons. All the teachers wer 
then given the opportunity to change their judgments if they wished to do so 
This procedure was repeated for the remaining questions on the test. 

The procedure for the Nedelsky standard-setting sessions was similar to 
that for the Angoff sessions, except that the correct answers to the test 
cjues t ions were indicated on the form the teachers used to record their 
judgments. The session leader did not have a fixed rule for deciding whethe 
or hot to ask the teachers to discuss their responses. Generally, the leade 
would call for discussion whenever one teacher eliminated all three wrong 
answers and another teacher did not, or when one teacher eliminated two of t 
three wrong answers and another teacher did not eliminate any. 

The passing score for the borderline-group method was set at the median 
test score of those students classified as "borderline". in those cases whe 




the borderline group contained fewer than four students, no borderline-group 
passing score will be reported. 

The passing score for the contrast ing-groups method was computed by esti- 
mating a conditional probability function: the probability that a student 
from the combined group of masters and nonmasters with a given test score 
would be classified as a master. (The estimation procedure is described 
briefly in the Appendix to this report.) If the con tras t ing-groups method 
works as It should, this probability will increase with the student's test 
score. The passing score was set at the test score for which this estimated 
probability was equal to .50. Tn those cases where either group— mas ters or 
nonmasters— -contained fewer than four students, rib contrast ing-groups passing 
score will be reported. (Ability-grouping in some of the schools led to this 
situation for some of the teachers.) Also, in those few cases where the test 
scores of the "masters 1 ' wore no higher than those of the "nonmasters", no 
contrast ing-groups passing score will be reported. 

The passing scores for the Nedelsky and Angoff methods were the sum of the 
probabilities for all test items, i.e., the expected score for a borderline 
test-taker, as computed from the judgments. 

In computing the passing scores for each school, the data were combined 
across teachers. For the borderline-group method, all the students judged 
"borderline", were combined into a single borderline group for the school: 
This procedure, by giving each student equal weight, tends to give a heavier 
weight to teachers who placed more students in the borderline group. A 
similar procedure was used for the contrast ing-groups method. For the 
Nedelsky and Angoff methods, the passing scores for the individual teachers 
were averaged by taking a simple mean, weighting each teacher equally. 
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Results 



Figures In and lb show the passing scores for each school, as determined 
by each method tried at the school. On the graph for each school the small 
circle represents the students 1 mean test score; the vertical line extends one 
standard deviation above and below the mean. (Table Al in the Appendix 
presents this information in numerical terms.) In some schools all three 
methods fried at the school produced similar passing scores ; in other schools 
the three methods produced very different results. Norte of the methods con- 
sistently produced results similar to those of any other method. 

All four methods produced passing scores that varied considerably from 
school to school. The contrasting-groups method showed the largest variation, 
producing both the highest and lowest passing scores on the math test as well 
as the lowest on the reading test. The borderline-group method tended to 
produce higher passing scores than the other methods on the reading test but 
not on the math test. It is difficult to generalize about the Nedelsky and 
Angoff methods on the basis of only four schools. The Nedelsky method 
produced low passing scores for both reading and math at Schools 1, 2, arid 4, 
but high passing scores on both tests at School 3. The Arigdff method tended 
to produce low passing scores on the reading test but riot on the math test. 
In general, the differences between the results of the different methods were 
net consistent across schools. 

Table 2 shows the failure rates that would have resulted from each of the 
parsing scores at each school. Most of the differ en ces be tween methods are 
substantial, and some are extremely large. The math test at school 3 shows 
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the largest differences; the corit rastihg-grbups passing score would have 
failed only eight percent of the students, while the Nedelsky passing score 
would have failed 91 percent. 

What happens when the standard-setting methods are applied separately for 
each individual teacher? The result of this analysis are s\ "n in Figures 
2n-2d . The passing scores for any particular method tend to be similar for 
teachers of the same subject in the same school, although there are excep- 
tions . (For the Nedelsky and Angoff methods this similarity may be partly a 
result of the group discussion included in the procedure-) 

Is it reasonable to expect the contrasting-groups method to produce a 
passing score similar to those produced by any of the other methods? 
Nedelsky 1 s method, Angoff' s method, and the borderline-group method are all 
based on the idea that the passing score should be the score that is typical 
of "borderline" test-takers. The choice of a passing score in the contrasting 
groups method, as applied in this study, was based on a different rationale - 
that of minimizing the number of mi sc lass if icat ions in a particular population 
of students. The contrasting-groups passing score depends not only on the 
test scores of the masters and nonmasters, but also on the number of students 
classified into each group. Where most of the students are masters, the 
masters may outnumber the nonmasters even at very low test score levels. As a 
result, the passing score will tend to be low (as compared with the passing 
scores set by other methods). Where most of the students are nonmasters , the 
passing score will tend to be high. 
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What happens when we look only at the schools in which the numbers of 
masters and nonmasters were approximately equal? There were six cases in 
which the size of the smaller group (masters or nonmasters) was at least 75 
percent of the size of the larger: the reading test in Schools 1, 3, and 6, 
and the math test in Schools 5, 6, and 7. In one of these cases no 
contrasting- groups passing score could be computed. In the remaining five 
cases, the borderline-group passing score and the contrasting-groups passing 
score tended to be close to each other. The differences between these two 
passing scores were 3.1 and 0.8 points on the 65-point reading test and 1.8, 
5.8, and 0.6 points on the 70-point math test. It seems reasonable to 
conclude that the cont rns t ing-groups method and the borderline-group method 
will tend to produce similar passing scores when approximately equal numbers 
of students are judged as masters and nonmasters. 

This result is corroborated by the results in the cases where brie group 
was much larger than the other. The masters greatly outnumbered the non- 
masters (by at least 5 to 1) on the reading test in Schools 2, 4, 7, and 8. 
In each of these schools the contrasting-groups passing score was far below 
the borderline-group passing score. On the math test, the masters outnumbered 
the nonmasters by at least 2.5 to 1 at Schools 3, A, and 8. In each of these 
schools the contrasting-groups passing score was below the borderline-group 
passing score (although the difference was not large in School £) . In Schools 
1 and 2 the nonmasters outnumbered the masters by at least 2.5 to 1; in both 
these schools the contrasting-groups passing score was. well above the 
borderline-group passing score. 
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What is the relationship between the passing scores produced by each 
method arid the students 1 ability? Table 3 shows the correlations between the 
passing scores arid the students 1 mean test score. The correlations have been 
computed for schools and for individual teachers. In general, the teachers 
whose students were more able tended to set higher passing scores. The single 
exception was the case of the Nedeisky passing scores for the reading test, 
which correlated negatively with the mean scores of the teachers 1 students. 
In all other cases the correlations were positive, and, in most cases, quite 
large. Even the contrast ing-groups method , which tends to produce a low 
passing score when most students are judged masters, produced higher passing 
scores where the students were more able. 

How closely did the teachers 1 judgments correspond to the students 1 test 
scores? Figures 3a and 3b show the means arid standard deviations of the test 
scores in the groups of students judged "master" , "borderline 11 , and 
"nonmaster" at each school. The vertical bar for each group extends from one 
standard deviation above the group mean to one standard deviation below. The 
horizontal line in the center of the bar indicates the mean. The number of 
students in each group is shown just below the bar. (The same information is 
presented in numerical form in Appendix Tables A2a and A2b* along with the 
means and standard deviations of scores for all participating students in each 
school. ) 

The results vary considerably from one school to another. The reading 
scores are shown in Figure 3a. They follow the expected pattern , with reason- 
ably good separation, in Schools 1 , 2, 4, and 6. In these schools the 
teachers were fairly accurate judges of their students 1 reading ability. "fn 

16 

ERIC 



- 12- 



Schools 7 and 8 the "masters" clearly scored higher than the other two groups, 
but the "nonmasters" scored as high or nearly as high as the "borderline" 
students. In School 5 the "borderline" students scored much lower than the 
"nonmasters". In School 3 there were no "borderline" students, and the scores 
of the "masters" differed very little from those of the "nonmasters". 

The math scores are shown in Figure 3b. The scores of the three groups - 
"masters", "borderline", and "nonmasters" - follow the expected pattern in all 
eight schools, with reasonably good separation of the groups in most of the 
schools. In Schools 1, 6, and 7 there was a large overlap between the scores 
of the "borderline" students and those of the "nonmasters", arid in School 3 
all the differences between groups were small. Schools 4 and 5 provide an 
interesting comparison. Although the "nonmasters" in School 4 scored higher 
than the "masters" in School 5, the teachers 1 judgments of their students in 
each of these schools corresponded quite well to the students' test scores 
(which were not available to the teachers at the time they made their judg- 
ments) . 

Figures 4a and 4b show the mean scores for students judged "master" , 
"borderline", and "nonmaster" by each teacher. Only the means based on four 
or more students are shown. With only one exception, the order of the three 
group mean scores for each teacher is as it should be: "masters" highest; 
"nonmasters" lowest. However, in some cases there was very little separation 
between group means for the same teacher. For example, a teacher 1 s "border- 
line" students may have scored only slightly higher than the same teacher's 
"nonmasters". The group means seem to imply some striking differences in 
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standards between teachers in the same school. For example, on the reading 
test in School 3, one teacher's "nonmasters" scored higher than the other two 
teachers "masters". A similar situation occurred for the math test in School 
lj where one teacher's "masters" scored about as low as the other two 
teachers' "nonmasters". More commonly, one teacher's "borderline" students 
would score as high as another teacher's "masters" or as low as another 
teacher ' s "nonmasters" . 

The information in Figures 3a and 3b provides a good indication of the 
extent to which the borderline-group method was working when applied to the 
full group of students classified as "borderline" in each school. Ideally, 
the test scores of the borderline group should have a small standard deviation 
and a mean score between the means for the masters and the nonmasters . 
In general, the standard deviation of scores for the borderline group was 
large - nearly as large as for the full group of students participating in the 
study. On the average (across schools), the borderline group standard 
deviation was 87 percent of the total-group standard deviation for the reading 
test and 86 percent for the math test. However, in 12 of the 15 cases, the 
borderline group mean score was clearly between the mean scores for masters 
and nonmasters. The exceptions were Schools 5 and 8 for the reading test arid 
School 6 for the math test, where the scores of the borderline group were as 
low as (or lower than) those of the nonmasters. (The borderline-group method 
could not be applied to the reading test in School 3, where no students were 
classified as "borderline" in reading.) 
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One possible reason for the large variation in the scores of the border- 
line group might be differences between individual teachers. If so, the 
borderline-group method might work better when applied separately to each 
individual teacher's own borderline group. Figures 4a and 4b show that for 
most teachers, the mean score of the borderline group is clearly between the 
mean scores for the masters and the nonmasters . However, Figures 5a arid 5b 
show that, even for individual teachers* the standard deviation of the 
borderline group tends to be nearly as large as the standard deviation of the 
scores of all the teacher 1 s students. For the reading teachers, the standard 
deviation of the borderline group averages 8.0 score points, which is 83 
percent as large as the average total-group standard deviation. For the math 
test the corresponding figures are 7.9 score points and 85 percent. 

Figures 6a, 6b, 6c, and 6d are graphs of the conditional probability 

functions used to determine the contrasting-groups passing scores for each 

* . .... . 

school. If the method worked as it should, the graph will show a curve 

rising from nearly zero probability at low test score levels to nearly one at 

high score levels. The passing score that misclassif ies the fewest students 

will be the score level at which the probability of being judged a master is 

.50. We have used the symbol "X^q" to refer to this score level. The 

expression "Xy^ - X^ 11 represents the difference between the test scores that 

correspond to a 75 percent probability arid a 25 percent probability of being 

judged a master. 

The method used to estimate these functions is described in the Appendix. 
The symbols fl A" and lf B lf at the top of each graph refer to a formula given in 
the Appendix. 
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The smaller this difference, the sharper the separation between the scores of 
the "masters" and "nonmasters" , and the better the contrast ing-groaps method 
is working. Another measure of the extent to which the test is separating 
"masters" from "nonmasters" is the slope of the curve at X^, which is its 
steepest point. The steeper the slope, the better the separation. This slope 
is also indicated bh each graph. 

Figures 6a arid 6b show the curves for the reading test scores, combining 
the judgments of the teachers in each school. The method seems to have worked 
fairly well in Schui. , L, 2, 4 and 6; not as well in Schools 7 and 8; poorly 
in School 5, and not at all in School 3. (The problem in School 3 seems to be 
the differences between individual teachers 1 standards, as can be seen in 
Figure 4a.) The four schools in which the method worked fairly well were 
those in which it yielded the highest standards. 

The conditional probability curves for the math test scores, shown in 
Figures 6c and 6d, present a very different picture. There were no schools 
where the method failed completely. The method worked extremely well in 
School 5, where the resulting standard was low, arid quite well in School 4, 
where the standard was high. It worked nearly as well an School 3, where it 
produced the lowest standard, as in School 2, where it produced the highest 
standard . Iri general, the schools where the cont ras ting-groups method worked 
best were not the same for math as for reading. 

Table 4 shows the number and the percentage of correct and incorrect 
classifications resulting from the use of the cont ras ting-groups passing score 
in each school. (In this case, "correct" means "the same as the teacher's 
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classification of the student. 11 ) Notice that where the overwhelming majority 
of the students are masters, the use of the contrasting-groups passing score 
will tend to misclassify most of the nonmasters (unless the separation between 
the groups is nearly perfect). This situation occurs for the reading test in 
Schools 2, 7, and 8, and, to a lesser extent, in School 4. In this case, it 
would be possible to obtain a fairly high percentage of correct classifica- 
tions by disregarding the test scores and passing ail the students. It is 
much harder to avoid classification errors when the numbers of masters and 
nonmasters is nearly equal, as is the case for math in School 5. 

These results suggest a question for further investigation: What would 
happen if the data from the contrasting-groups method were used to set a 
standard based on a rationale similar to that of the other methods? Such a 
standard might be based on the idea of classifying a student into the group - 
master or nonmaster ~ in which the student 's test score would be more typical. 
It would not depend on the number of students classified into each group, but 
only on their test scores. As a result, it would not minimize the number of 
wrong decisions unless the test-taker population contained equal numbers of 
masters and nonmasters. Therefore, such a standard would not be a wise choice 
as a passing score, except in this special case. But would such a standard 
agree closely with the passing scores set by the borderline-group method? 

To investigate this question, we defined a number which we call ll C2 M , 
computed from the contrasting-groups data by the formula 



C2 = 



x-a/s-5 + x-c-i/s-j 

- m m n n 

l/s + l/s 
m n 
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where x represents the mean score, s represents the standard deviation of 
scores, and m and n refer to the groups of masters and nonmasters . This 
number, used as the basis for classification, places a student into the group 
in which the student's test score would be fewer standard deviations from the 
mean. For example, if a student's test score would be 0.7 standard deviations 
below the mean in the group of "masters" but 0.8 standard deviations above the 
mean in the group of nonmasters, the student's score would be above the M C2 ,! 
standard . 

Figure 7 shows the relationship between !I C2 I! and the borderline-group 
passing score. Each data point represents one school. Obviously, the two 
standards agree closely. Not only is the correlation very high, but 14 of the 
15 data points are very close to the diagonal line y=x. 



Discuss ion 

The purpose of this study was to answer two important questions about the 
four standard-setting methods investigated: 

1. Do the different methods - yield similar passing scores? 

2. if not, are the differences between methods systematic and 
predictab le? 

The comparison between the borderline-group method and the cont ras t ing-groups 
method provided a qualified "yes" answer to the first question and a "yes" to 
the second. Where the number of "^asters" and the number of "nonmasters" were 
similar (i.e., differed by 25 percent or less), the cont rast ing-groups method 
and the borderline-group method yielded similar passing scores. Where the 
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"masters" greatly outnumbered the "nonmasters" , the contrasting-grbups method 
produced a lower passing score than the borderline-group method, as could be 
expected. Conversely, where the "nonmasters" greatly outnumbered the 
"masters' 1 , the contras ting-groups method produced the higher passing score. 

The comparisons involving the Nedelsky and Angoff methods were far less 
predictable. Each of these methods was applied at four schools, and for both 
methods the results varied from school to school. 

The results of this study reflected a general tendency for teachers of 
higher-ability students to set higher standards. (The single exception was 
the application of the Nedelsky method to the reading test.) This finding was 
not an artifact of the methods. There was no suggestion to the teachers or 
any limit on the number of students to be classified as masters or nonmasters. 
Indeed, some teachers did classify ail or nearly all of their students into 
the same group. There was no suggestion of a relative standard in the verbal 
definition given to the teachers. The standard was to represent the minimum 
level of reading/math skiil : ecessary to function adequately as an adult in 
American society. Possibly the teachers at the schools with more able 
students envision a different type of adult life for their students than do 
the teachers at the schools where students are less able. 

One surprising finding was the large variation in the test scores of the 
students classified as "bbrde rline 11 . This group often included some of the 
highest and lowest scoring students, despite the availability of a "cannot 
judge 11 category for students whose skills the teachers did not feel confident 
in judging. The large variation in the scores of the borderline group 
occurred for individual teachers as well as for schools. 
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What accounts for the presence of many high and low scoring students in 
the boiUeriine group? More generally, why were the teachers 1 judgments of 
their students' skills often inconsistent with the students 1 test scores? One 
possibility is that the teachers may have been unaware of the skills (or lack 
of skills) of some of their students. The skills tested were those taught in 
elementary school; junior high school teachers may not observe these skills 
systematically. Another possibility is that the teachers may have based their 
judgments on reading or math skills other than those measured by the test, cr 
on only a subset of the skills measured by the tests. the teachers might not 
have agreed completely with the developers of the tests as to which specific 
reading arid math skills are necessary to function adequately as an adult. k 
third possibility is that the test scores may simply have been invalid for 
some students. Some students may not have made a genuine effort; some may 
have copied from their neighbors' papers, taken together, these three 
explanations may account for the inconsistencies between test scores and 
teachers' judgments. And in most cases, despite these inconsistencies, the 
agreement between test scores and judgments was good enough to provide a 
fairly clear choice of a passing score, by either the cont ras t ing-groups 
method or the bo rderline-grocp method. 

Possibly the most important finding of bur study is that the results of 
the Nedelsky arid Arigbff methods were generally riot consistent with those of 
the borderline-group method. The Nedelsky arid Arigbff judgments did riot 
accurately reflect the actual test performance of real students classified as 
"borderline 11 . But were the results of the borderline-group method valid? In 




ERIC 



- 20 - 



12 of 15 cases, the borderline group mean score was clearly between the mean 
scores for masters and nonmasters. In the other three cases, the test scores 
of the borderline group were close to or below those of the nonmas ters * but in 
two of these three cases the Angoff passing score was even lower than the 
border 1 Lne-group passing score. Any correction applied to the 
borderline-group passing score would have moved it even farther from the 
Angoff passing score. 

this finding leads us to suggest that those who set passing scores use 
methods based on the test scores of real test-takers whenever possible. Those 
who use the Nedelsky and Angoff methods might consider a modification that 
allows the judges to revise their judgments oh the basis of actual student 
response data from the test. 

The results of this study might have been different if the teachers at 
each school had been required to agree on a precise verbal definition of the 
standard in behavioral terms before judging their students or the test 
questions. if the teachers had specified their own standard in terms of the 
specific reading or math tasks that distinguish "masters" from "nonmasters" , 
the teachers 1 judgments of their students might have been more consistent with 
their judgments of the test questions. Unfortunately, the teachers 1 time 
available for this study was not sufficient to allow for such a step in the 
procedure. This step could be the missing link that provides for consistency 
between standard-setting methods based on judgments about students and methods 
based on judgments about test questions. 
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Table 1: Design of the Study 



School* 



Community 
Type 

Urban 



Non-u rbari 



Urban 



Non-u rbari 



Urban 



Urban 



Urban 



Non-u rban 



Grade Level 
of Students 

6, 7, 8 
7, 8 

7, 8, 9 
8 

6, 7, 8 
7, 8 
7, 8 
7, 8 



First 

Questions 
(Nedelsky) 

Questions 
(Nedelsky) 

Students 



Students 



Questions 
(Angof f) 



Questions 
(Angof f) 



Students 



Students 



Students 



Students 



Questions 
(Nedelsky) 

Questions 
(Nedelsky) 



Students 



Students 



Questions 
(Angof f) 

Quest io ns 
(Angof f) 



*The numbering of the schools is arbitrary and does not correspond to the 
order in which data were collected. 
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Table 2: Failure rates resulting from passing score 
determined by each method in each school. 



Reading Test 
School 



Contrasting 
Groups 



1 .44 

2 .05 

3 ™ 

4 .11 

5 .30 

6 .46 

7 .03 

8 .04 



Borderline 
Group 



,51 
.26 

.31 
,24 
.46 
.30 
.21 



Nedelsky 

.31 
.04 
.68 
.05 



Angof f 



; 11 
. 18 
;36 
.07 



Math Test 
School 



3 
4 
5 
6 
7 
8 



.89 
.82 
.08 
.28 
.55 
.64 
.42 
.20 



.59 
.65 
.45 
.33 
.45 
.37 
.38 
.41 



39 
.20 
.91 
.24 



.55 
.77 
.47 
.76 
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Table 3: Correlations between passing score arid 
students 1 mean test: scores 



Passing score 

Contrasting groups 
Borderline group 
Nede lsky 
Angof f 



Reading Test 
Schools Tea che rs 



.33 
.91 



.43 
.70 
-.33 
.61 



Math Test 
Schools Teachers 



.64 
.93 



.65 
.84 
.70 
.85 



a Correlations based on fewer than eight observations were not computed. 



NOTE: The four columns of this table correspond, respectively, to Figure la, 
Figures 2a and 2b, Figure lb, and Figures 2c and 2d. 
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table hi Correct and incorrect classifications resulting from 

contrasting groups method. 



jading Test 
School 



Number of students 
"Masters'* "Nonmast ers 1 ' 
Pass Fail Pass Fail 



Proport ion correctly classified 
"Masters 11 n Nonmasters M Combined 



Slope of 
conditional 
probability 
cu rve 



1 


97 


18 


14 


84 


.84 


.86 


.85 


1/20 


2 


134 


3 


12 


4 


.98 


.25 


.90 


i/26 


3 
















0 


4 


123 


6 


15 


10 


.95 


.40 


.86 


1/17 


5 


44 


3 


17 


4 


.94 


.19 


.71 


1/58 


6 


113 


17 


16 


86 


.87 


.84 


.86 


1/20 


7 


213 


5 


25 


4 


.98 


.14 


.88 


1/39 


8 


220 


6 


37 


4 


.97 


.10 


.84 


1/41 



1 


10 


26 


8 


120 


.28 


.94 


.79 


1/34 


2 


33 


18 


10 


220 


.65 


.96 


.90 


1/21 


3 


70 


3 


27 


2 


.96 


.07 


.71 


1/27 


4 


114 


13 


14 


32 


.90 


.70 


.84 


1/18 


5 


63 


11 


10 


68 


.85 


.87 


.86 


1/12 


6 


81 


64 


37 


136 


.56 


.79 


.68 


1/39 


7 


101 


29 


33 


59 


.78 


.64 


.72 


1/29 


8 


150 


16 


28 


28 


.90 


.50 


.80 


1/31 
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Figure La. hissing scores for the reading test; computed for each school: 
C = Contrasting-groups; B = Borderline-group; N = Nedelsky; 
A = Angoff. 



School School School School School School School 
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Figure Lb; Passing scores for the math test, computed for each school: 
C = Contras ting-groups; B = Borderline-group; N = Nedelsky; 
A = Angoff. 
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Figure 2a. Passing scores for the reading test, computed for individual 
teachers in Schools 1-4, 

(C = Contrasting-groups; B - Borderline-group; N = Ned el sky) 
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Figure 2b. Passing scores for the reading test, computed for individual 
teachers in Schools 5-3. 

(C = Contrasting-groups; B = Borderline-group; A = Arigoff) 
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Figure 2c. Passing scores for the. math test, computed for individual 
teachers in Schools 1-4. 

(C = Contrast ing-groups; B = Borderline-group; N = Nedelsky) 
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Figure 2d; Passing scores' for the math test, computed for individual 
teachers in Schools 5-8. 

(C = Con t ran ting-groups; B = Borderline-group; A = Angoff) 
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Means and standard deviations of reading test scores of 
students judged "Master", "Border Line"; and "Nonmasfr" 
in each schooi, 
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Means and standard deviations of mathematics test scores 
of students judged "Master", "Borderline", and "Nonmaster" 
in each school. 
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Figure Ah. Mean math Lest scores of students judged "master 1 ', "borderline", 
and "nonmaster" by each teacher. 
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Figure 5a. Distribution of standard deviation of test scores 

for borderline group and for all students: reading teat 
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Figure 5b. Distribution of standard deviations of test scores 

for borderline group and tor all students: math test 
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Figure 6a; Estimated relationship between reading test scores and 
probability of mastery judgment: Schools 1-4, 
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Figure 6b. Estimated relationship between reading test scores and 
probability of mastery judgment: Schools 5-8. 
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Figure 6d. Estimated relationship between math test scores and 
probability of mastery judgment: Schools 5-8, 
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Figure 7 Borderline-group passing score and "C2" standard for each school, 
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Appendix 



Computing Passing Scores from Contra sting-Groups Data . 

The procedure used in this study to compute passing scores from 
contrast ing-groups data is based on decision theory. This approach 
requires, for each test score level, ah estimate of the probability that a 
student with that test score would be judged a master (given that the 
student would be judged either a master or a nonmaster) . If the number of 
students at each score level were very large, it would make sense to Use 
the percentage of masters at each score level as a probability estimate. 
But, because the number of students at each store level is small, it is 
necessary to use some other means of estimating the conditional probability 
function. The method used in this study is called "logistic regression". 
It assumes that the conditional probability function can be described by 
the equation 



where e is the familiar constant 2.71828* x is the student's test score, 
and a and b are parameters estimated from the data (in this case, by using 
the BMDP statistical software package). Once the a- and b_ parameters have 
been estimated, it is possible to find the value of x (the test score) 
corresponding to any desired value of P (probability of being'" judged a 
master). In particular, when P = .56, then x = -a/b. We have referred to 
this score as "X^". The slope of the conditional probability carve is 
steepest at this point, where it is equal to b/4 . 



P = 



1 



1 + e 



-(a + bx) 
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Stability of the Contras ting-Groups Passing 

The sampling variability of X^ depends on the slope and the sample 



size. The larger the sample and the steeper the slope, the more precisely 



X^q can be determined. When the sample is not too small, it is possible to 
estimate the standard error of X^q by the fonnaia 



2 

Var (X 5Q ) = Var (a) + ~ Var (b) + Cov (a,b) . 
b b b 



The variances and covariance of the a and b^ parameters are computed by 
BMDP ; 

The standard errors of the cont ras ting-groups passing scores computed 
for each school as a whole were as follows: 

Reading (65 points) Math (70 points) 

School 1 1.0 2.6 

2 3.2 1.2 

3 not computed 2.6 

4 1.7 1.1 

5 6.2 0.7 

6 1.0 1.2 

7 3.7 1.2 

8 3.7 1.6 

These standard errors do not include the selection of individual teachers 
as a source of variability. They do include any unreliability in the 
individual teachers 1 judgments and in the students 1 test scores, as well as 
the selection of individual students. Thus, the standard errors refer to 
replications of the procedure with the same teachers but different 
s tu de rit s . 
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The standard errors of the cont rasting-groups passing score computed 
for individual teachers varied from 1.6 to 7.9 points for the reading test 
and from 1.4 to 5;1 points for the math test. 

Stability of the Border! ine-Group Passing Score . 

Although the borderline group often contained students with very high 
and very low scores, its median could be quite stably estimated when the 
method was applied to the school as a whole. There is rib single simple 
formula for the standard error of the median; its standard error depends on 
the parent distribution* which is unknown. However, the standard error of 
the mean should provide a reasonable approximation. The standard errors of 
the mean of the scores of the borderline grJUp in each school were as 
f til lows : 



Reading (65 points) Math (70 points) 

School 1 6.9 1.0 

2 1.2 1.1 

3 not compat ed 6.8 

4 0.8 1.5 

5 1.4 0.7 

6 0.9 0.9 

7 1.4 1.8 

8 1.9 1.4 



These standard errors refer to replications of the procedure with the same 
t eache rs but dif f e rent students . 
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Reading 
School 



-lath 
School 



Table Al: Passing 


; scores and 
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and standard 




deviation of 


stude nt s 
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Pnnf ract i no- 
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J) • jL 


H / • U 


10.5 


8 34.4 


48.5 




— 


7 1 « 
J / • 0 


J J • 0 


8.5 


1 41.4 


29 




25.0 




Z.O • D 


9.4 


2 52.3 


45 




27.7 




39 .0 


12.5 


3 15.3 


22 




29 .8 




22.7 


6.6 


4 43,7 


46 




42.2 




50.9 


10.9 


5 24.8 


23 






24.6 


24.7 


8.9 


6 30.8 


25 






34. 7 C 


28.3 


9.7 


7 31.6 


31 






33.9 


34.9 


10.6 


8 27.6 


35.5 






48.6 


39 .4 


11.9 



Could not be computed. 

^bes hot include one teacher who was unable to attend standard setting session. 
'Does hot include two teachers who were unable to attend standard setting session. 
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Table Ala. Means and Standard Deviations of Reading Test Scores, 
by School and Judgment 



All 

School Students "Master" "Borderline" "Nonmaster " 



1 


Mean 


45.5 


52.8 


46.2 


36.5 




SD 


11.1 


9.3 


9.7 


7.6 




N 


3 29 


1 15 


112 




2 


Mean 


54.0 


56.4 


48.8 


43.4 




SD 


8,3 


7.1 


6.5 


9-1 




N 


186 


137 


31 


Id 


3 


Mean 


39.1 


39.8 


— 


38.5 




SD 


8.9 


8.5 




9.2 




N 


146 


66 


0 


Q Pi 


4 


Mean 


58.4 


60.6 


57.3 


49 .3 




SD 


7.1 


4.1 


5.5 


12.4 




N 


204 


129 


49 


25 


5 


Mean 


39.8 


47.9 


32.5 


42.0 




SD 


11.8 


8.6 


9.1 


9.9 




N 


124 


47 


44 


21 


6 


Mean 


44.1 


53.7 


42.8 


33.8 




SD 


12.9 


8.0 


11.3 


11.3 




N 


38 7 


130 


152 


102 


7 


Mean 


47.0 


49 .4 


40.8 


37.0 




SD 


10.5 


9,6 


7.6 


11.2 




N 


288 


218 


30 


29 


8 


Mean 


53.8 


55.4 


46.6 


47.3 




SD 


8.5 


7.6 


7.3 


8.8 




N 


284 


226 


14 


41 
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Table A.2b. Means arid Standard Deviations of Math Test Scores, 





by 


School and 


Judging nt 










a 4- i- 








Schoo 1 




Students 


"Master 11 


"Borderline" 


"Nonmaster" 


1 


Mean 


28.6 


36.5 


29.8 


25.9 




c r\ 

DU 


Q A 


0 . 1 


£ A 


0 . 7 




N 


2 29 


36 


65 


128 


2 


Me a n 


39 .0 


53.7 


44.5 


33.5 




DU 




7 A 


in 7 






N 


376 


51 


95 


230 


3 


Mean 


22.7 


24.1 


22.3 


20.0 




SD 


0 . O 


1 . O 


j . o 






N 


157 


73 


54 


29 


4 


Mean 


50.9 


56.7 


45.5 


39 .7 




SD 


10.9 


o . 3 


o n 

o.9 


/ . U 




N 


209 


127 


34 


46 


5 


Mean 


24.7 


32.6 


24.0 


18.9 




SD 


8.9 


8. 7 


6.9 


3 . J 




N 


266 


74 


106 


78 


6 


Mean 


28.3 


33.5 


25.9 


25.2 




SD 


9 .7 


9 .4 


8.4 


8.8 




N 


413 


145 


94 


173 


7 


Mean 


34.9 


40.0 


32.7 


29 .2 




SD 


10.6 


9.7 


10.4 


7.8 




N 


266 


130 


35 


92 


S 


Mean 


39 .4 


43.6 


36.0 


30.0 




SD 


11.9 


11.4 


10.2 


8.3 




N 


275 


166 


52 


56 
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